Loading the library:
library("metacell")
library("ggplot2")
To start using MetaCell, you first initialize a database. This is not much more than linking the package to directory that stores all your objects.
In our case we will initialize the database to the saved_work directory:
if(!dir.exists("saved_work")) dir.create("saved_work/")
scdb_init("saved_work/", force_reinit=T)
#> initializing scdb to saved_work/
tgconfig::override_params("annotations/astro_params.yaml",package = "metacell")
src_functions = "../sc_aux_functions.R"
force_reinit=T instruct the system to override existing database objects. This can be important if you are running in cycles and would like to update your objects. Otherwise, the database reuses loaded objects to save time on reading and initializing them from the disk.
tgconfig::override_params() overrides default metacell configurations.
Before starting to analyze the data, we link the package to a figure directory:
if(!dir.exists("results")) dir.create("results/")
scfigs_init("results/")
We will read multiple MARS umi matrices (umi.tab) and merge them, based on a table defining the datasets
@param mat_nm defines the ID of the matrix object (and is going to be the name of all the objects from now on)
@param base_dir defines the umitab directory
@param mat_nm defines the name (id) of matrix
@param datasets_table_fn defines the index file of the MARS multi batch dataset. This is a tab delimited text file, with an arbitrary number of columns and a header line.
The three mandatory fields are:
Amp.Batch.ID - specify the ID of the batch defined by the row, and also the file name (without the .txt suffix) of the respective umi table in the base_dir provided.
Seq.Batch.ID - efines and ID of the sequencing batch (may be relevant for further noise cleanups beyond those done in the low-level pipeline).
Batch.Set.ID - The third id group different batches into sets for downstream analysis (e.g. QC and more).
index_fn = "annotations/080119_reseq.txt"
id = "reseq"
mc_id = id
Let us take a look at our index file:
| Amp.Batch.ID | Seq.Batch.ID | Batch.Set.ID | Protocol_version_ID | Pool_barcode | R2_design | Experiment_ID | Species | Empty_wells |
|---|---|---|---|---|---|---|---|---|
| AB1584 | SB78 | 552_kit | SPID V0.9 | TGCATGCA | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
| AB1585 | SB78 | 552_kit | SPID V0.9 | AAGGCGAT | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
| AB1586 | SB78 | 552_kit | SPID V0.9 | GTGTCTGA | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
| AB1587 | SB78 | 558_kit | SPID V0.9 | GCGGATAT | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
| AB1588 | SB78 | 558_kit | SPID V0.9 | TACAACGC | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
| AB1589 | SB78 | 558_kit | SPID V0.9 | ATTGCGAC | 7W.8R | PDS_kit | mouse | O1,O2,P1,P2 |
Let’s load a matrix to the system:
umi.tab_dir = "/home/labs/amit/eyald/sc_pipeline/scdb_v4_mouse/output/umi.tab/"
mcell_import_multi_mars(mat_nm = id, dataset_table_fn = index_fn, base_dir = umi.tab_dir, force = T)
#> will read AB1584
#> will read AB1585
#> will read AB1586
#> will read AB1587
#> will read AB1588
#> will read AB1589
#> [1] TRUE
mat = scdb_mat(id)
print(dim(mat@mat))
#> [1] 52634 2304
The scdb_mat() command returns a matrix object, which has one slot containing the count matrix (mat@mat), as well as additional features we will mention below.
MetaCell uses a standardized naming scheme for the figures, to make it easier to archive and link analysis figures to the database objects.
In principle, figures in the figures directory are named after the object data type they refer to (for example, mat for matrices, mc for metacells, and more, see below).
The figure name then includes also the object name they refer to, and a suffix describing the actual figure type.
To get a basic understanding of the new data, we will plot the distribution of UMI count per cell (the plot is thresholded after 500 umi counts):
mcell_plot_umis_per_cell(id,min_umis_cutoff = 500)
#> [1] 500
Umi distribution plot
We want to clean some known issues from the matrix before starting to work with it.
We generate a list of mitochondrial genes that typically mark cells as being stressed or dying, as well as immunoglobulin genes that may represent strong clonal signatures in plasma cells, rather than cellular identity.
mat = scdb_mat(id)
nms = c(rownames(mat@mat), rownames(mat@ignore_gmat))
ig_genes = c(grep("^Igj", nms, v=T),
grep("^Igh",nms,v=T),
grep("^Igk", nms, v=T),
grep("^Igl", nms, v=T))
bad_genes = unique(c(grep("ERCC|^mt", nms, v=T), ig_genes))
bad_genes
#> [1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009"
#> [5] "ERCC-00012" "ERCC-00013" "ERCC-00014" "ERCC-00016"
#> [9] "ERCC-00017" "ERCC-00019" "ERCC-00022" "ERCC-00024"
#> [13] "ERCC-00025" "ERCC-00028" "ERCC-00031" "ERCC-00033"
#> [17] "ERCC-00034" "ERCC-00035" "ERCC-00039" "ERCC-00040"
#> [21] "ERCC-00041" "ERCC-00042" "ERCC-00043" "ERCC-00044"
#> [25] "ERCC-00046" "ERCC-00048" "ERCC-00051" "ERCC-00053"
#> [29] "ERCC-00054" "ERCC-00057" "ERCC-00058" "ERCC-00059"
#> [33] "ERCC-00060" "ERCC-00061" "ERCC-00062" "ERCC-00067"
#> [37] "ERCC-00069" "ERCC-00071" "ERCC-00073" "ERCC-00074"
#> [41] "ERCC-00075" "ERCC-00076" "ERCC-00077" "ERCC-00078"
#> [45] "ERCC-00079" "ERCC-00081" "ERCC-00083" "ERCC-00084"
#> [49] "ERCC-00085" "ERCC-00086" "ERCC-00092" "ERCC-00095"
#> [53] "ERCC-00096" "ERCC-00097" "ERCC-00098" "ERCC-00099"
#> [57] "ERCC-00104" "ERCC-00108" "ERCC-00109" "ERCC-00111"
#> [61] "ERCC-00112" "ERCC-00113" "ERCC-00116" "ERCC-00117"
#> [65] "ERCC-00120" "ERCC-00123" "ERCC-00126" "ERCC-00130"
#> [69] "ERCC-00131" "ERCC-00134" "ERCC-00136" "ERCC-00137"
#> [73] "ERCC-00138" "ERCC-00142" "ERCC-00143" "ERCC-00144"
#> [77] "ERCC-00145" "ERCC-00147" "ERCC-00148" "ERCC-00150"
#> [81] "ERCC-00154" "ERCC-00156" "ERCC-00157" "ERCC-00158"
#> [85] "ERCC-00160" "ERCC-00162" "ERCC-00163" "ERCC-00164"
#> [89] "ERCC-00165" "ERCC-00168" "ERCC-00170" "ERCC-00171"
#> [93] "mt-Atp6" "mt-Atp8" "mt-Co1" "mt-Co2"
#> [97] "mt-Co3" "mt-Cytb" "mt-Nd1" "mt-Nd2"
#> [101] "mt-Nd3" "mt-Nd4" "mt-Nd4l" "mt-Nd5"
#> [105] "mt-Nd6" "mt-Rnr1" "mt-Rnr2" "mt-Ta"
#> [109] "mt-Tc" "mt-Td" "mt-Te" "mt-Tf"
#> [113] "mt-Tg" "mt-Th" "mt-Ti" "mt-Tk"
#> [117] "mt-Tl1" "mt-Tl2" "mt-Tm" "mt-Tn"
#> [121] "mt-Tp" "mt-Tq" "mt-Tr" "mt-Ts1"
#> [125] "mt-Ts2" "mt-Tt" "mt-Tv" "mt-Tw"
#> [129] "mt-Ty" "Igha" "Ighd" "Ighd1-1"
#> [133] "Ighd2-3" "Ighd2-4" "Ighd2-5" "Ighd2-6"
#> [137] "Ighd2-7" "Ighd2-8" "Ighd3-1" "Ighd3-2"
#> [141] "Ighd4-1" "Ighd5-1" "Ighd5-2" "Ighd5-3"
#> [145] "Ighd5-4" "Ighd5-5" "Ighd5-6" "Ighd5-7"
#> [149] "Ighd5-8" "Ighd6-1" "Ighd6-2" "Ighe"
#> [153] "Ighg1" "Ighg2b" "Ighg2c" "Ighg3"
#> [157] "Ighj1" "Ighj2" "Ighj3" "Ighj4"
#> [161] "Ighm" "Ighmbp2" "Ighv1-1" "Ighv1-10"
#> [165] "Ighv1-11" "Ighv1-12" "Ighv1-13" "Ighv1-14"
#> [169] "Ighv1-15" "Ighv1-16" "Ighv1-17" "Ighv1-17-1"
#> [173] "Ighv1-18" "Ighv1-19" "Ighv1-19-1" "Ighv1-2"
#> [177] "Ighv1-20" "Ighv1-21" "Ighv1-21-1" "Ighv1-22"
#> [181] "Ighv1-23" "Ighv1-24" "Ighv1-25" "Ighv1-26"
#> [185] "Ighv1-27" "Ighv1-28" "Ighv1-29" "Ighv1-3"
#> [189] "Ighv1-30" "Ighv1-31" "Ighv1-32" "Ighv1-33"
#> [193] "Ighv1-34" "Ighv1-35" "Ighv1-36" "Ighv1-37"
#> [197] "Ighv1-38" "Ighv1-39" "Ighv1-4" "Ighv1-40"
#> [201] "Ighv1-41" "Ighv1-42" "Ighv1-43" "Ighv1-44"
#> [205] "Ighv1-45" "Ighv1-46" "Ighv1-47" "Ighv1-48"
#> [209] "Ighv1-49" "Ighv1-5" "Ighv1-50" "Ighv1-51"
#> [213] "Ighv1-52" "Ighv1-53" "Ighv1-54" "Ighv1-55"
#> [217] "Ighv1-56" "Ighv1-57" "Ighv1-58" "Ighv1-59"
#> [221] "Ighv1-6" "Ighv1-60" "Ighv1-61" "Ighv1-62"
#> [225] "Ighv1-62-1" "Ighv1-62-2" "Ighv1-62-3" "Ighv1-63"
#> [229] "Ighv1-64" "Ighv1-65" "Ighv1-66" "Ighv1-67"
#> [233] "Ighv1-68" "Ighv1-69" "Ighv1-7" "Ighv1-70"
#> [237] "Ighv1-71" "Ighv1-72" "Ighv1-73" "Ighv1-74"
#> [241] "Ighv1-75" "Ighv1-76" "Ighv1-77" "Ighv1-78"
#> [245] "Ighv1-79" "Ighv1-8" "Ighv1-80" "Ighv1-81"
#> [249] "Ighv1-82" "Ighv1-83" "Ighv1-84" "Ighv1-85"
#> [253] "Ighv1-86" "Ighv1-9" "Ighv10-1" "Ighv10-2"
#> [257] "Ighv10-3" "Ighv10-4" "Ighv11-1" "Ighv11-2"
#> [261] "Ighv12-1" "Ighv12-2" "Ighv12-3" "Ighv13-1"
#> [265] "Ighv13-2" "Ighv14-1" "Ighv14-2" "Ighv14-3"
#> [269] "Ighv14-4" "Ighv15-1" "Ighv15-2" "Ighv16-1"
#> [273] "Ighv2-1" "Ighv2-2" "Ighv2-3" "Ighv2-4"
#> [277] "Ighv2-5" "Ighv2-6" "Ighv2-6-8" "Ighv2-7"
#> [281] "Ighv2-8" "Ighv2-9" "Ighv2-9-1" "Ighv3-1"
#> [285] "Ighv3-2" "Ighv3-3" "Ighv3-4" "Ighv3-5"
#> [289] "Ighv3-6" "Ighv3-7" "Ighv3-8" "Ighv4-1"
#> [293] "Ighv4-2" "Ighv5-1" "Ighv5-10" "Ighv5-11"
#> [297] "Ighv5-12" "Ighv5-12-4" "Ighv5-13" "Ighv5-15"
#> [301] "Ighv5-16" "Ighv5-17" "Ighv5-18" "Ighv5-19"
#> [305] "Ighv5-2" "Ighv5-21" "Ighv5-3" "Ighv5-4"
#> [309] "Ighv5-5" "Ighv5-6" "Ighv5-7" "Ighv5-8"
#> [313] "Ighv5-9" "Ighv5-9-1" "Ighv6-1" "Ighv6-2"
#> [317] "Ighv6-3" "Ighv6-4" "Ighv6-5" "Ighv6-6"
#> [321] "Ighv6-7" "Ighv7-1" "Ighv7-2" "Ighv7-3"
#> [325] "Ighv7-4" "Ighv8-1" "Ighv8-10" "Ighv8-11"
#> [329] "Ighv8-12" "Ighv8-13" "Ighv8-14" "Ighv8-15"
#> [333] "Ighv8-16" "Ighv8-2" "Ighv8-3" "Ighv8-4"
#> [337] "Ighv8-5" "Ighv8-6" "Ighv8-7" "Ighv8-8"
#> [341] "Ighv8-9" "Ighv9-1" "Ighv9-2" "Ighv9-3"
#> [345] "Ighv9-4" "Igkc" "Igkj1" "Igkj2"
#> [349] "Igkj3" "Igkj4" "Igkj5" "Igkv1-108"
#> [353] "Igkv1-110" "Igkv1-115" "Igkv1-117" "Igkv1-122"
#> [357] "Igkv1-131" "Igkv1-132" "Igkv1-133" "Igkv1-135"
#> [361] "Igkv1-136" "Igkv1-35" "Igkv1-88" "Igkv1-99"
#> [365] "Igkv10-94" "Igkv10-95" "Igkv10-96" "Igkv11-106"
#> [369] "Igkv11-114" "Igkv11-118" "Igkv11-125" "Igkv12-38"
#> [373] "Igkv12-40" "Igkv12-41" "Igkv12-42" "Igkv12-44"
#> [377] "Igkv12-46" "Igkv12-47" "Igkv12-49" "Igkv12-66"
#> [381] "Igkv12-67" "Igkv12-89" "Igkv12-98" "Igkv13-54-1"
#> [385] "Igkv13-55-1" "Igkv13-56-1" "Igkv13-57-1" "Igkv13-57-2"
#> [389] "Igkv13-61-1" "Igkv13-62-1" "Igkv13-64" "Igkv13-71-1"
#> [393] "Igkv13-73-1" "Igkv13-74-1" "Igkv13-76" "Igkv13-78-1"
#> [397] "Igkv13-80-1" "Igkv13-82" "Igkv13-84" "Igkv13-85"
#> [401] "Igkv13-87" "Igkv14-100" "Igkv14-111" "Igkv14-118-1"
#> [405] "Igkv14-118-2" "Igkv14-126" "Igkv14-126-1" "Igkv14-130"
#> [409] "Igkv14-134-1" "Igkv15-101" "Igkv15-101-1" "Igkv15-102"
#> [413] "Igkv15-103" "Igkv15-97" "Igkv16-104" "Igkv17-121"
#> [417] "Igkv17-127" "Igkv17-134" "Igkv18-36" "Igkv19-93"
#> [421] "Igkv2-105" "Igkv2-107" "Igkv2-109" "Igkv2-112"
#> [425] "Igkv2-113" "Igkv2-116" "Igkv2-137" "Igkv2-93-1"
#> [429] "Igkv2-95-1" "Igkv2-95-2" "Igkv20-101-2" "Igkv3-1"
#> [433] "Igkv3-10" "Igkv3-11" "Igkv3-12" "Igkv3-12-1"
#> [437] "Igkv3-2" "Igkv3-3" "Igkv3-4" "Igkv3-5"
#> [441] "Igkv3-6" "Igkv3-7" "Igkv3-8" "Igkv3-9"
#> [445] "Igkv4-50" "Igkv4-51" "Igkv4-53" "Igkv4-54"
#> [449] "Igkv4-55" "Igkv4-56" "Igkv4-57" "Igkv4-57-1"
#> [453] "Igkv4-58" "Igkv4-59" "Igkv4-60" "Igkv4-61"
#> [457] "Igkv4-62" "Igkv4-63" "Igkv4-65" "Igkv4-68"
#> [461] "Igkv4-69" "Igkv4-70" "Igkv4-71" "Igkv4-72"
#> [465] "Igkv4-73" "Igkv4-74" "Igkv4-75" "Igkv4-77"
#> [469] "Igkv4-78" "Igkv4-79" "Igkv4-80" "Igkv4-81"
#> [473] "Igkv4-83" "Igkv4-86" "Igkv4-90" "Igkv4-91"
#> [477] "Igkv4-92" "Igkv5-37" "Igkv5-39" "Igkv5-40-1"
#> [481] "Igkv5-43" "Igkv5-45" "Igkv5-48" "Igkv6-13"
#> [485] "Igkv6-14" "Igkv6-15" "Igkv6-17" "Igkv6-20"
#> [489] "Igkv6-23" "Igkv6-25" "Igkv6-29" "Igkv6-32"
#> [493] "Igkv7-33" "Igkv8-16" "Igkv8-18" "Igkv8-19"
#> [497] "Igkv8-21" "Igkv8-22" "Igkv8-24" "Igkv8-26"
#> [501] "Igkv8-27" "Igkv8-28" "Igkv8-30" "Igkv8-31"
#> [505] "Igkv8-34" "Igkv9-119" "Igkv9-120" "Igkv9-123"
#> [509] "Igkv9-124" "Igkv9-128" "Igkv9-129" "Iglc1"
#> [513] "Iglc2" "Iglc3" "Iglc4" "Iglj1"
#> [517] "Iglj2" "Iglj3" "Iglj3p" "Iglj4"
#> [521] "Igll1" "Iglon5" "Iglv1" "Iglv2"
#> [525] "Iglv3"
We will next ask the package to ignore the above genes:
mcell_mat_ignore_genes(new_mat_id=id, mat_id=id, bad_genes, reverse=F)
Ignored genes are kept in the matrix for reference, but all downstream analysis will disregard them.
This means that the number of UMIs from these genes cannot be used to distinguish between cells.
In the current example we will also eliminate cells with less than 400 UMIs (threshold can be set based on examination of the UMI count distribution):
mcell_mat_ignore_small_cells(id, id, 500)
Note that filtering decisions can be iteratively modified given results of the downstream analysis.
We move on to computing statistics on the distributions of each gene in the data, which are going to be our main tool for selecting feature genes for MetaCell analysis:
mcell_add_gene_stat(gstat_id=id, mat_id=id, force=T)
#> Calculating gene statistics...
#> will downsamp
#> done downsamp
#> will gen mat_n
#> done gen mat_n
#> done computing basic gstat, will compute trends
#> ..done
This generates a new object of type gstat under the name gstatreseq, by analyzing the count matrix matreseq. We can explore interesting genes and their distributions:
gstat = scdb_gstat(id)
print(head(gstat))
#> name tot var is_on_count sz_cor sz_cor_norm
#> 0610009B22Rik 0610009B22Rik 1226 10.2964360 96 0.222 0.029
#> 0610009E02Rik 0610009E02Rik 26 0.0964637 4 0.029 -0.006
#> 0610009L18Rik 0610009L18Rik 53 0.4819485 4 0.115 0.055
#> 0610009O20Rik 0610009O20Rik 69 0.3305751 11 0.046 -0.005
#> 0610010F05Rik 0610010F05Rik 56 0.2837682 8 0.019 -0.036
#> 0610010K14Rik 0610010K14Rik 395 2.1891578 35 0.084 -0.037
#> niche_stat niche_norm n_mean ds_top1 ds_top2 ds_top3
#> 0610009B22Rik 1 0 0.2043186 12 10 8
#> 0610009E02Rik 1 0 0.0044853 7 1 1
#> 0610009L18Rik 1 0 0.0076623 2 2 2
#> 0610009O20Rik 1 0 0.0158533 3 3 3
#> 0610010F05Rik 1 0 0.0118414 4 2 1
#> 0610010K14Rik 1 0 0.0896102 10 8 8
#> ds_mean ds_var ds_log_varmean ds_vm_norm
#> 0610009B22Rik 0.1533413 0.6899052 2.1315162 -0.07945580
#> 0610009E02Rik 0.0059666 0.0310092 1.6863442 0.94851422
#> 0610009L18Rik 0.0041766 0.0077437 0.4599985 -0.02665584
#> 0610009O20Rik 0.0107399 0.0225712 0.7962717 -0.40928570
#> 0610010F05Rik 0.0071599 0.0154711 0.7362682 -0.26470375
#> 0610010K14Rik 0.0531026 0.2425516 2.0849571 -0.03928909
#> ds_is_on_count downsample_n
#> 0610009B22Rik 96 750
#> 0610009E02Rik 4 750
#> 0610009L18Rik 4 750
#> 0610009O20Rik 11 750
#> 0610010F05Rik 8 750
#> 0610010K14Rik 35 750
print(quantile(gstat$ds_vm_norm,c(1:20)/20))
#> 5% 10% 15% 20% 25% 30%
#> -1.01951310 -0.71614349 -0.53383802 -0.41735351 -0.30897845 -0.22398068
#> 35% 40% 45% 50% 55% 60%
#> -0.17109455 -0.09143406 0.00000000 0.00000000 0.00000000 0.11264763
#> 65% 70% 75% 80% 85% 90%
#> 0.20021039 0.28101798 0.36465806 0.49567051 0.64412965 0.83955549
#> 95% 100%
#> 1.15266282 3.58906673
t_vm = quantile(gstat$ds_vm_norm,c(1:20)/20)[17]
png("results/hist.gstats.t_vm.png")
hist(gstat$ds_vm_norm,breaks=c(floor(min(gstat$ds_vm_norm)*10):ceiling(max(gstat$ds_vm_norm)*10))/10,main = "Histogram of downsampled variance divided by mean")
abline(v=t_vm)
axis(side = 1,at=t_vm,labels=t_vm)
dev.off()
#> png
#> 2
Histogram of ds_vm_norm
Selecting a gene set for downstream analysis:
We create a new object of type gset (gene set), to which all genes whose scaled variance (variance divided by mean, AKA ds_vm_norm) exceeds a given threshold are added.
The command creates a new gene set with all genes for which the scaled variance is higher than 0.6441296, it also restricts this gene set to genes with at least 50 UMIs across the entire dataset, and also requires selected genes to have at least three cells for more than 4 UMIs were recorded.
mcell_gset_filter_multi(gstat_id=id, gset_id=id, T_tot=50, T_top3=4, T_vm = t_vm, force_new = T)
#> Selected 737 markers
We can refine our parameters by plotting all genes and our selected gene set given the mean and variance statistics:
mcell_plot_gstats(gstat_id=id, gset_id=id)
#> png
#> 2
var mean plot
gset_unfiltered = scdb_gset(id)
rp_markers = grep("Rpl|Rps|mt-", names(gset_unfiltered@gene_set), v=T)
markers_table = read.delim("annotations/markers.txt")$gene
markers_to_keep = setdiff(names(gset_unfiltered@gene_set), unique(c(bad_genes,rp_markers)))
scdb_del_gset(id)
#> [1] TRUE
tmp = rep(1, length(markers_to_keep)); names(tmp) = markers_to_keep
new_gset = gset_new_gset(tmp, "filtered gset")
scdb_add_gset(id, new_gset)
downsample_n = scm_which_downsamp_n(mat)
mat_ds = scm_downsamp(mat@mat, downsample_n)
gene_gene_cors = cor(t(as.matrix(mat_ds[names(new_gset@gene_set),])),method = "spearman");
n_genes = length(names(new_gset@gene_set))
cs_top2 = matrixStats::rowOrderStats(gene_gene_cors, which = n_genes-1)
cs_top3 = matrixStats::rowOrderStats(gene_gene_cors, which = n_genes-2)
cs_top4 = matrixStats::rowOrderStats(gene_gene_cors, which = n_genes-3)
markers_to_keep = names(new_gset@gene_set[cs_top3>0.1]);
scdb_del_gset(id)
#> [1] TRUE
tmp = rep(1, length(markers_to_keep)); names(tmp) = markers_to_keep
new_gset = gset_new_gset(tmp, "filter low gene gene correlation")
scdb_add_gset(id, new_gset)
Assuming we are happy with the selected genes, we will move forward to create a similarity graph (cgraph), using a construction called balanced K-nn graph:
set.seed(27)
mcell_add_cgraph_from_mat_bknn(mat_id=id,gset_id = id,graph_id=id,K=100,dsamp=T)
#> will downsample the matrix, N= 750
#> will build balanced knn graph on 1884 cells and 639 genes, this can be a bit heavy for >20,000 cells
This adds to the database a new cgraph object named cgraph.reseq.
The K=100 parameter is important, as it affects the size distribution of the derived metacells.
The next step will use the cgraph to sample five hundred metacell partitions, each covering 75% of the cells and organizing them in dense subgraphs:
set.seed(27)
mcell_coclust_from_graph_resamp(coc_id=id,graph_id=id,min_mc_size=15,p_resamp=0.75, n_resamp=500)
#> running bootstrap to generate cocluster
#> 0%...27%...38%...50%...60%...71%...82%...93%...100%
#> done resampling
The metacell size distribution of the resampled partitions will be largely determined by the K parameter used for computing the cgraph.
The resampling process may take a while if the graphs are very large. You can modify n_resamp to generate fewer resamples.
The resampling procedure creates a new coclust object in the database named coclust.reseq, and stores the number of times each pair of cells ended up being part of the same metacell (the cnt column).
The co-clustering statistics are used to generate a new similarity graph, based on which accurate calling of the final set of metacells is done:
set.seed(27)
mcell_mc_from_coclust_balanced(coc_id=id,mat_id= id,mc_id= id,K=30, min_mc_size=15, alpha=2)
#> filtered 376283 left with 107817 based on co-cluster imbalance
#> building metacell object, #mc 21
#> add batch counts
#> compute footprints
#> compute absolute ps
#> compute coverage ps
#> reordering metacells by hclust and most variable two markers
#> reorder on Aldoc vs Trf
We created a metacell object mc.reseq based on analysis of the co-clustering graph.
The parameter K determines the number of neighbors we wish to minimally associate with each cell.
Prior to partitioning the co-cluster graph is filtered to eliminate highly unbalanced edges, with smaller alpha resulting in harsher filtering.
We will first assign random colors to our clusters (these can later be modified with custom color definitions, e.g. based on cell type assignments).
mc<- scdb_mc(id)
mc_id = id
no_color_mc = paste0("no_color_",id)
mc@colors <- colorRampPalette(c("darkgray", "burlywood1", "chocolate4","orange", "red", "purple", "blue","darkgoldenrod3", "cyan"))(ncol(mc@mc_fp))
scdb_add_mc(id = no_color_mc,mc = mc)
scdb_add_mc(id = id,mc = mc)
mc<- scdb_mc(no_color_mc)
The metacell object mc.no_color_reseq can now be visualized.
In order to do this effectively, we usually go through one or two iterations of selecting informative marker genes.
The package can select markers for you automatically - by simply looking for genes that are strongly enriched in any of the metacells:
mcell_gset_from_mc_markers(gset_id=paste0(mc_id,"_markers"), mc_id=no_color_mc)
mcell_mc_plot_marks(mc_id=no_color_mc, gset_id=paste0(mc_id,"_markers"), mat_id=id,plot_cells = T)
mcell_mc_plot_marks(mc_id=no_color_mc, gset_id=paste0(mc_id,"_markers"), mat_id=id,plot_cells = F)
heatmap_cells_marks_mc
heatmap_marks_mc
We can take a look on the distribution of gene markers (requires a prior literature review), and generate a colorize table.
Assume we have a marker genes table like that:
| name | gene | T_fold |
|---|---|---|
| NA | Npy | 2 |
| NA | Fabp7 | 2 |
| NA | Apoe | 2 |
| NA | Malat1 | 2 |
| NA | Sec62 | 2 |
| NA | Mt2 | 2 |
The values plotted are color coded log2(fold enrichment) and fold enrichment values of the metacell over the median of all other metacells:
mc = scdb_mc(no_color_mc)
marks = as.character(read.delim(paste0("results/",no_color_mc,".cells_heat_marks_gene_names.txt"))[[1]])
gene_markers = markers_table[,"gene"]
nx = ceiling(length(gene_markers)/3)
ny = 3
lfp <- log2(mc@mc_fp)
png("results/genes_log_distribution.png",w=6000,h=(6000/nx)*ny*3)
layout(matrix(1:(nx*ny), nx , ny , byrow=T))
for(gene in gene_markers){
par(mar=c(5,4,5,4),xpd=TRUE)
barplot(lfp[gene,],col=mc@colors,las=2,cex.axis=2,ylab="log2FC",xlab="metacells")
if(gene %in% markers_table$gene && !is.na(markers_table[markers_table$gene == gene,"T_fold"])){
abline(h=log2(markers_table[markers_table$gene == gene,"T_fold"]))
}
title(main = gene,cex.main=3.6)
}
dev.off()
#> png
#> 2
Genes log distribution
#> png
#> 2
Genes distribution
While 2D projections are popular and intuitive (albeit sometimes misleading) ways to visualize scRNA-seq results, we can also summarize the similarity structure among metacells using a “confusion matrix” which encodes the pairwise similarities between all metacells.
This matrix may capture hierarchical structures or other complex organizations among metacells.
We first create a hierarchical clustering of metacells, based on the number of similarity relations between their cells:
set.seed(27)
mc_hc = mcell_mc_hclust_confu(mc_id=no_color_mc,graph_id=id)
Next, we generate clusters of metacells based on this hierarchy, and visualize the confusion matrix and these clusters.
The confusion matrix is shown at the bottom, and the top panel encodes the cluster hierarchy (subtrees in blue, sibling subtrees in gray):
set.seed(27)
mc_sup = mcell_mc_hierarchy(mc_id=no_color_mc,mc_hc=mc_hc, T_gap=0.04)
save(file="saved_work/mc_hc_sup.Rda",mc_hc,mc_sup)
source(src_functions)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
local_mcell_mc_plot_hierarchy(mc_id=no_color_mc,graph_id=id,mc_order=mc_hc$order,sup_mc = mc_sup,width=3500, height=3500, min_nmc=2)
#> png
#> 2
confusion matrix
sup_col = data.frame(supid=c(),color=c(),name=c())
sup_col = rbind(sup_col,data.frame(supid=c(9),color=c("gray"),name=c("Others")))
sup_col = rbind(sup_col,data.frame(supid=c(1),color=c("purple"),name=c("Oligodendrocytes")))
sup_col = rbind(sup_col,data.frame(supid=c(10),color=c("green"),name=c("Astrocytes")))
write.table(sup_col,file = "annotations/sup_col.txt",row.names = FALSE,sep = "\t")
gene_key = data.frame(name = c(), gene =c(), color = c(), T_fold = c())
gene_key = rbind(gene_key,data.frame(name = c("Endothelial"), gene =c("Bsg"), color = c("#AB7737"), T_fold = c(4.8)))
gene_key = rbind(gene_key,data.frame(name = c("Leptomeningeal"), gene =c("Dcn"), color = c("#e5e500"), T_fold = c(4)))
gene_key = rbind(gene_key,data.frame(name = c("Pericytes"), gene =c("Vtn"), color = c("#add8e6"), T_fold = c(6)))
gene_key = rbind(gene_key,data.frame(name = c("Ependymal"), gene =c("Tmem212"), color = c("#FF4500"), T_fold = c(4.5)))
write.table(gene_key, file="annotations/gene_key.txt",row.names = FALSE,sep = "\t")
mc_colorize_sup_hierarchy(mc_id = mc_id,supmc = mc_sup,supmc_key = "annotations/sup_col.txt",gene_key = "annotations/gene_key.txt")
local_mcell_mc_plot_hierarchy(mc_id=mc_id,graph_id=id,mc_order=mc_hc$order,sup_mc = mc_sup,width=4500, height=4500, min_nmc=2)
#> png
#> 2
confusion matrix
We may want to visualize the similarity structure among metacells (or among cells within metacells).
We construct a 2D projection of the metacells, and use it to plot the metacells and key similarities between them (shown as connecting edges), as well as the cells. This plot will use the same metacell coloring we established before.
set.seed(27)
source(src_functions)
mcell_mc2d_force_knn(mc2d_id= mc_id,mc_id=mc_id, graph_id=id)
#> comp mc graph using the graph reseq and K 20
mc2d = scdb_mc2d(id)
mc = scdb_mc(id)
new_mc_y = mc2d@mc_y
new_mc_y[16] = new_mc_y[16] - 20
xy_df = data.frame(mc_x = mc2d@mc_x, mc_y = new_mc_y)
mcell_mc2d_force_knn_on_cells(mc2d_id = mc_id,mc_id = id,graph_id = id,mc_xy = xy_df)
#> comp mc graph using the graph reseq and K 20
tgconfig::set_param("mcell_mc2d_height",2200, "metacell")
tgconfig::set_param("mcell_mc2d_width",2300, "metacell")
tgconfig::set_param("mcell_mc2d_legend_cex",3,"metacell")
local_mcell_mc2d_plot(mc2d_id=mc_id,plot_edges = T,marg = c(1,22,5,0),inset = c(-0.26,-0.05))
#> png
#> 2
Note that we changed the metacell parameters “mcell_mc2d_height/width” to get a reasonably-sized figure.
We obtain the following figure:
We can use the colors to produce a labeled heat map, showing selected genes and their distributions over metacells, with the colored annotation shown at the bottom:
mcell_mc_plot_marks(mc_id=mc_id, gset_id=paste0(id,"_markers"), mat_id=id,plot_cells=T)
tgconfig::set_param("mcp_heatmap_text_cex",3, "metacell")
tgconfig::set_param("mcp_heatmap_height",2600, "metacell")
mcell_mc_plot_marks(mc_id=mc_id, gset_id=paste0(id,"_markers"), mat_id=id,plot_cells=F)
heatmap_cells_marks_after_colorizing
heatmap_marks_after_colorizing
# generates the excel file of clusters and genes
mcell_mc_export_tab(mc_id=mc_id,gstat_id=id,mat_id=id,T_gene_tot=100, metadata_fields=c("amp_batch_id"))
The first 3 capture plates created by a kit for brain dissocation, the last 3 capture plates created by our new protocol
We would like to compare between both protocols
mc = scdb_mc(mc_id)
color_key = unique(mc@color_key[,c("group","color")])
name2color = color_key$color
names(name2color) = color_key$group
mat = scdb_mat(id)
df = data.frame(Well_ID = names(mc@mc),mc=as.integer(mc@mc),
Amp.Batch.ID = as.character(mat@cell_metadata[names(mc@mc),"amp_batch_id"]))
group_1_batches = c("AB1584","AB1585","AB1586")
group_1_name = "Kit"
group_2_name = "New protocol"
df$Group = ifelse(df$Amp.Batch.ID %in% group_1_batches,group_1_name,group_2_name)
df$mc = factor(df$mc, levels = sort(unique(df$mc)))
png("results/barplot_amp_batches.png",width = 1500,height = 1500)
par(mar = c(1,5,2,0), fig=c(0,0.8,0.05,1),cex=2)
barplot(table(df$Group,df$mc),col = c("#ff9500","#51a242"),axes = FALSE,ylab = "Cell count")
axis(2)
par(mar = c(0,5,1,0),fig=c(0,0.8,0,0.05),new=T)
barplot(rep(1,length(mc@colors)),col = mc@colors[as.integer(colnames(table(df$Group,df$mc)))],axes=F)
par(mar = c(0,1,2,0),fig=c(0.8,1,0.05,0.3),new=T)
plot.new()
legend("topleft", legend = names(name2color), pch = 19,col = as.character(name2color), bty = "n")
par(mar = c(0,1,2,0),fig=c(0.8,1,0.3,0.4),new=T)
plot.new()
legend("topleft", legend = c("Kit","New protocol"), pch = 19,col = c("#ff9500","#51a242"), bty = "n")
dev.off()
#> png
#> 2
Protoclos distribution over metacells